Customer Segmentation using Clustering


This mini-project is based on this blog post by yhat. Please feel free to refer to the post for additional information, and solutions.


In [1]:
%matplotlib inline
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

# Setup Seaborn
sns.set_style("whitegrid")
sns.set_context("poster")

Data

The dataset contains information on marketing newsletters/e-mail campaigns (e-mail offers sent to customers) and transaction level data from customers. The transactional data shows which offer customers responded to, and what the customer ended up buying. The data is presented as an Excel workbook containing two worksheets. Each worksheet contains a different dataset.


In [2]:
df_offers = pd.read_excel("./WineKMC.xlsx", sheetname=0)
df_offers.columns = ["offer_id", "campaign", "varietal", "min_qty", "discount", "origin", "past_peak"]
df_offers.head()


Out[2]:
offer_id campaign varietal min_qty discount origin past_peak
0 1 January Malbec 72 56 France False
1 2 January Pinot Noir 72 17 France False
2 3 February Espumante 144 32 Oregon True
3 4 February Champagne 72 48 France True
4 5 February Cabernet Sauvignon 144 44 New Zealand True

We see that the first dataset contains information about each offer such as the month it is in effect and several attributes about the wine that the offer refers to: the variety, minimum quantity, discount, country of origin and whether or not it is past peak. The second dataset in the second worksheet contains transactional data -- which offer each customer responded to.


In [3]:
df_transactions = pd.read_excel("./WineKMC.xlsx", sheetname=1)
df_transactions.columns = ["customer_name", "offer_id"]
df_transactions['n'] = 1
df_transactions.head()


Out[3]:
customer_name offer_id n
0 Smith 2 1
1 Smith 24 1
2 Johnson 17 1
3 Johnson 24 1
4 Johnson 26 1

Data wrangling

We're trying to learn more about how our customers behave, so we can use their behavior (whether or not they purchased something based on an offer) as a way to group similar minded customers together. We can then study those groups to look for patterns and trends which can help us formulate future offers.

The first thing we need is a way to compare customers. To do this, we're going to create a matrix that contains each customer and a 0/1 indicator for whether or not they responded to a given offer.

Checkup Exercise Set I

Exercise: Create a data frame where each row has the following columns (Use the pandas [`merge`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html) and [`pivot_table`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html) functions for this purpose):

  • customer_name
  • One column for each offer, with a 1 if the customer responded to the offer

Make sure you also deal with any weird values such as `NaN`. Read the documentation to develop your solution.


In [4]:
#your turn

In [5]:
df = df_transactions.merge(df_offers, on='offer_id')
df = df.pivot_table(index='customer_name', columns='offer_id', values='n', fill_value=0)
df.head()


Out[5]:
offer_id 1 2 3 4 5 6 7 8 9 10 ... 23 24 25 26 27 28 29 30 31 32
customer_name
Adams 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 1 0 0
Allen 0 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 1 0 0 0 0 0
Anderson 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 1 0 0 0 0 0 0
Bailey 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
Baker 0 0 0 0 0 0 1 0 0 1 ... 0 0 0 0 0 0 0 0 1 0

5 rows × 32 columns

K-Means Clustering

Recall that in K-Means Clustering we want to maximize the distance between centroids and minimize the distance between data points and the respective centroid for the cluster they are in. True evaluation for unsupervised learning would require labeled data; however, we can use a variety of intuitive metrics to try to pick the number of clusters K. We will introduce two methods: the Elbow method, the Silhouette method and the gap statistic.

Choosing K: The Elbow Sum-of-Squares Method

The first method looks at the sum-of-squares error in each cluster against $K$. We compute the distance from each data point to the center of the cluster (centroid) to which the data point was assigned.

$$SS = \sum_k \sum_{x_i \in C_k} \sum_{x_j \in C_k} \left( x_i - x_j \right)^2 = \sum_k \sum_{x_i \in C_k} \left( x_i - \mu_k \right)^2$$

where $x_i$ is a point, $C_k$ represents cluster $k$ and $\mu_k$ is the centroid for cluster $k$. We can plot SS vs. $K$ and choose the elbow point in the plot as the best value for $K$. The elbow point is the point at which the plot starts descending much more slowly.

Checkup Exercise Set II

Exercise:

  • What values of $SS$ do you believe represent better clusterings? Why?
  • Create a numpy matrix `x_cols` with only the columns representing the offers (i.e. the 0/1 colums)
  • Write code that applies the [`KMeans`](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) clustering method from scikit-learn to this matrix.
  • Construct a plot showing $SS$ for each $K$ and pick $K$ using this plot. For simplicity, test $2 \le K \le 10$.
  • What challenges did you experience using the Elbow method to pick $K$?
  • Make a bar chart showing the number of points in each cluster for k-means under the best $K$.

In [6]:
# your turn

The Sum of Squares (SS) measures the sum of the squared distances from a data point to the centroid of that cluster. You want to minimize the SS in a cluster to make the cluster more distinct. The best case scenario would be large distances between different clusters with relatively small SS for each cluster, but this is unlikely. There is a trade off with minimizing the SS and the number of clusters. You can decrease the SS by increasing the number of clusters but it is also ideal to keep the number of clusters as low as possible. Due to this trade off, the best K (number of clusters) can be chosen by using the elbow method.


In [7]:
import numpy as np

print(df.columns)
x_cols = np.matrix(df[df.columns])
x_cols


Int64Index([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
            18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32],
           dtype='int64', name='offer_id')
Out[7]:
matrix([[0, 0, 0, ..., 1, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ..., 
        [1, 0, 0, ..., 0, 1, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 1, 1]], dtype=int64)

In [8]:
from sklearn.cluster import KMeans

inertias = []
ks = range(2, 11)

for k in ks:
    kMeans_model = KMeans(n_clusters=k)
    kMeans_model.fit(x_cols)
    inertias.append(kMeans_model.inertia_)
    
plt.plot(ks, inertias)
plt.xlabel("K")
plt.ylabel("SS")
plt.title("K vs SS")
plt.show()


Using this plot, it is difficult to apply the elbow method and choose the best K. The plot is almost linear and there is no clear "elbow". We can arbitrarily choose 5 clusters for now, but a better method to choose K should be used.


In [9]:
import collections

kMeans_model = KMeans(n_clusters=5, random_state=5)
clusters = kMeans_model.fit_predict(x_cols)

count_by_cluster = collections.OrderedDict(sorted(collections.Counter(clusters).items()))

plt.bar(list(count_by_cluster.keys()), list(count_by_cluster.values()))
plt.title("Data Points in Each Cluster")
plt.xlabel("Cluster")
plt.ylabel("Number of Data Points")

for k, v in count_by_cluster.items():
    print(k,":", v)


0 : 19
1 : 28
2 : 16
3 : 24
4 : 13

Choosing K: The Silhouette Method

There exists another method that measures how well each datapoint $x_i$ "fits" its assigned cluster and also how poorly it fits into other clusters. This is a different way of looking at the same objective. Denote $a_{x_i}$ as the average distance from $x_i$ to all other points within its own cluster $k$. The lower the value, the better. On the other hand $b_{x_i}$ is the minimum average distance from $x_i$ to points in a different cluster, minimized over clusters. That is, compute separately for each cluster the average distance from $x_i$ to the points within that cluster, and then take the minimum. The silhouette $s(x_i)$ is defined as

$$s(x_i) = \frac{b_{x_i} - a_{x_i}}{\max{\left( a_{x_i}, b_{x_i}\right)}}$$

The silhouette score is computed on every datapoint in every cluster. The silhouette score ranges from -1 (a poor clustering) to +1 (a very dense clustering) with 0 denoting the situation where clusters overlap. Some criteria for the silhouette coefficient is provided in the table below.


Range Interpretation
0.71 - 1.0 A strong structure has been found.
0.51 - 0.7 A reasonable structure has been found.
0.26 - 0.5 The structure is weak and could be artificial.
< 0.25 No substantial structure has been found.

</pre> Source: http://www.stat.berkeley.edu/~spector/s133/Clus.html

Fortunately, scikit-learn provides a function to compute this for us (phew!) called sklearn.metrics.silhouette_score. Take a look at this article on picking $K$ in scikit-learn, as it will help you in the next exercise set.

Checkup Exercise Set III

Exercise: Using the documentation for the `silhouette_score` function above, construct a series of silhouette plots like the ones in the article linked above.

Exercise: Compute the average silhouette score for each $K$ and plot it. What $K$ does the plot suggest we should choose? Does it differ from what we found using the Elbow method?


In [10]:
# Your turn.

In [11]:
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm

silhouette_score_by_k = {}

for k in ks:
    fig, (ax1) = plt.subplots(1, 1)

    ax1.set_xlim([-0.2, 0.6])
    ax1.set_ylim([0, len(x_cols) + (k + 1) * 10])

    clusterer = KMeans(n_clusters=k, random_state=10)
    cluster_labels = clusterer.fit_predict(x_cols)

    silhouette_avg = silhouette_score(x_cols, cluster_labels)
    print("For n_clusters =", k,
          "The average silhouette_score is :", silhouette_avg)
    silhouette_score_by_k[k] = silhouette_avg

    sample_silhouette_values = silhouette_samples(x_cols, cluster_labels)

    y_lower = 10
    for i in range(k):
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.spectral(float(i) / k)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        y_lower = y_upper + 10

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([]) 
    ax1.set_xticks([-0.2, -0.1, 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6])

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % k),
                 fontsize=14, fontweight='bold')

    plt.show()


For n_clusters = 2 The average silhouette_score is : 0.0936557328349
For n_clusters = 3 The average silhouette_score is : 0.118899428636
For n_clusters = 4 The average silhouette_score is : 0.123470539196
For n_clusters = 5 The average silhouette_score is : 0.14092516242
For n_clusters = 6 The average silhouette_score is : 0.137179893911
For n_clusters = 7 The average silhouette_score is : 0.116109245662
For n_clusters = 8 The average silhouette_score is : 0.113395738326
For n_clusters = 9 The average silhouette_score is : 0.125059605278
For n_clusters = 10 The average silhouette_score is : 0.119283321348

In [12]:
for k, score in silhouette_score_by_k.items():
    print("K =", k, " => ", "silhouette score =", score)


K = 2  =>  silhouette score = 0.0936557328349
K = 3  =>  silhouette score = 0.118899428636
K = 4  =>  silhouette score = 0.123470539196
K = 5  =>  silhouette score = 0.14092516242
K = 6  =>  silhouette score = 0.137179893911
K = 7  =>  silhouette score = 0.116109245662
K = 8  =>  silhouette score = 0.113395738326
K = 9  =>  silhouette score = 0.125059605278
K = 10  =>  silhouette score = 0.119283321348

The silhouette scores are all lower than 0.25 which indicate that no substantial structure has been found. The silhouette method may not be a good option to choose the best K. However, based on this method, the best number of clusters would be 5 clusters, with 6 clusters coming in a close second place.

Choosing $K$: The Gap Statistic

There is one last method worth covering for picking $K$, the so-called Gap statistic. The computation for the gap statistic builds on the sum-of-squares established in the Elbow method discussion, and compares it to the sum-of-squares of a "null distribution," that is, a random set of points with no clustering. The estimate for the optimal number of clusters $K$ is the value for which $\log{SS}$ falls the farthest below that of the reference distribution:

$$G_k = E_n^*\{\log SS_k\} - \log SS_k$$

In other words a good clustering yields a much larger difference between the reference distribution and the clustered data. The reference distribution is a Monte Carlo (randomization) procedure that constructs $B$ random distributions of points within the bounding box (limits) of the original data and then applies K-means to this synthetic distribution of data points.. $E_n^*\{\log SS_k\}$ is just the average $SS_k$ over all $B$ replicates. We then compute the standard deviation $\sigma_{SS}$ of the values of $SS_k$ computed from the $B$ replicates of the reference distribution and compute

$$s_k = \sqrt{1+1/B}\sigma_{SS}$$

Finally, we choose $K=k$ such that $G_k \geq G_{k+1} - s_{k+1}$.

Aside: Choosing $K$ when we Have Labels

Unsupervised learning expects that we do not have the labels. In some situations, we may wish to cluster data that is labeled. Computing the optimal number of clusters is much easier if we have access to labels. There are several methods available. We will not go into the math or details since it is rare to have access to the labels, but we provide the names and references of these measures.

  • Adjusted Rand Index
  • Mutual Information
  • V-Measure
  • Fowlkes–Mallows index

See this article for more information about these metrics.

Visualizing Clusters using PCA

How do we visualize clusters? If we only had two features, we could likely plot the data as is. But we have 100 data points each containing 32 features (dimensions). Principal Component Analysis (PCA) will help us reduce the dimensionality of our data from 32 to something lower. For a visualization on the coordinate plane, we will use 2 dimensions. In this exercise, we're going to use it to transform our multi-dimensional dataset into a 2 dimensional dataset.

This is only one use of PCA for dimension reduction. We can also use PCA when we want to perform regression but we have a set of highly correlated variables. PCA untangles these correlations into a smaller number of features/predictors all of which are orthogonal (not correlated). PCA is also used to reduce a large set of variables into a much smaller one.

Checkup Exercise Set IV

Exercise: Use PCA to plot your clusters:

  • Use scikit-learn's [`PCA`](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) function to reduce the dimensionality of your clustering data to 2 components
  • Create a data frame with the following fields:
    • customer name
    • cluster id the customer belongs to
    • the two PCA components (label them `x` and `y`)
  • Plot a scatterplot of the `x` vs `y` columns
  • Color-code points differently based on cluster ID
  • How do the clusters look?
  • Based on what you see, what seems to be the best value for $K$? Moreover, which method of choosing $K$ seems to have produced the optimal result visually?

Exercise: Now look at both the original raw data about the offers and transactions and look at the fitted clusters. Tell a story about the clusters in context of the original data. For example, do the clusters correspond to wine variants or something else interesting?


In [13]:
#your turn

In [14]:
from sklearn.decomposition import PCA

pca_model = PCA(n_components=2, random_state=5)
pca_features = pca_model.fit_transform(x_cols)
x = pca_features[:, 0]
y = pca_features[:, 1]

df_pca = pd.DataFrame({'customer_name': df.index, 
                          'cluster': clusters,
                          'x': x,
                          'y': y}).set_index('customer_name').reset_index()

df_pca.head()


Out[14]:
customer_name cluster x y
0 Adams 3 1.007580 0.108215
1 Allen 1 -0.287539 0.044715
2 Anderson 2 -0.392032 1.038391
3 Bailey 3 0.699477 -0.022542
4 Baker 1 0.088183 -0.471695

What we've done is we've taken those columns of 0/1 indicator variables, and we've transformed them into a 2-D dataset. We took one column and arbitrarily called it x and then called the other y. Now we can throw each point into a scatterplot. We color coded each point based on it's cluster so it's easier to see them.


In [15]:
plt.scatter(x, y, c=clusters, cmap="Set1")
plt.show()


These clusters are based on the original 0/1 data. The orange and brown data points on the top and right areas seem to be distinct clusters. For the middle and bottom left region, the clusters overlap. This area is not very clear. It could be one big cluster or it could be split up into several smaller clusters. Overall, clustering is very subjective. Looking at the two methods we have used to try to find the best K, we determined that neither method gave us a clear answer. Next we can look at clustering using the PCA features.


In [16]:
fig = plt.figure()
i = 1
for k in ks:
    model = KMeans(n_clusters=k, random_state=5)
    cluster_labels = model.fit_predict(pca_features)
    
    ax = fig.add_subplot(3, 3, i)
    plt.scatter(x, y, c=cluster_labels, cmap="Set1", s=30)
    plt.title("k = " + str(k))
    ax.set_yticks([]) 
    ax.set_xticks([]) 
    i += 1


Exercise Set V

As we saw earlier, PCA has a lot of other uses. Since we wanted to visualize our data in 2 dimensions, restricted the number of dimensions to 2 in PCA. But what is the true optimal number of dimensions?

Exercise: Using a new PCA object shown in the next cell, plot the `explained_variance_` field and look for the elbow point, the point where the curve's rate of descent seems to slow sharply. This value is one possible value for the optimal number of dimensions. What is it?


In [17]:
#your turn
# Initialize a new PCA model with a default number of components.
# import sklearn.decomposition
# pca = sklearn.decomposition.PCA()
# pca.fit(X)

# Do the rest on your own :)

In [18]:
pca = PCA()
pca.fit(x_cols)

print(pca.explained_variance_)

plt.plot(range(1, len(pca.explained_variance_)+1), pca.explained_variance_, 'o-')
plt.xlabel('Component')
plt.ylabel('Variance')
plt.title('Variance by Component')
plt.show()


[ 0.40555241  0.30446016  0.20026967  0.1653668   0.14865096  0.14200293
  0.13680698  0.12070372  0.1151981   0.10696228  0.09838435  0.09401002
  0.08603449  0.07184171  0.06543861  0.06183019  0.05578045  0.05274121
  0.04681513  0.04349972  0.03861419  0.03589526  0.03421157  0.0320274
  0.02911226  0.02592039  0.02285085  0.02121206  0.01862586  0.01635995
  0.01411925  0.00770111]

Inspecting the plot above, the elbow seems to be at 4 or 5 components.

Other Clustering Algorithms

k-means is only one of a ton of clustering algorithms. Below is a brief description of several clustering algorithms, and the table provides references to the other clustering algorithms in scikit-learn.

  • Affinity Propagation does not require the number of clusters $K$ to be known in advance! AP uses a "message passing" paradigm to cluster points based on their similarity.

  • Spectral Clustering uses the eigenvalues of a similarity matrix to reduce the dimensionality of the data before clustering in a lower dimensional space. This is tangentially similar to what we did to visualize k-means clusters using PCA. The number of clusters must be known a priori.

  • Ward's Method applies to hierarchical clustering. Hierarchical clustering algorithms take a set of data and successively divide the observations into more and more clusters at each layer of the hierarchy. Ward's method is used to determine when two clusters in the hierarchy should be combined into one. It is basically an extension of hierarchical clustering. Hierarchical clustering is divisive, that is, all observations are part of the same cluster at first, and at each successive iteration, the clusters are made smaller and smaller. With hierarchical clustering, a hierarchy is constructed, and there is not really the concept of "number of clusters." The number of clusters simply determines how low or how high in the hierarchy we reference and can be determined empirically or by looking at the dendogram.

  • Agglomerative Clustering is similar to hierarchical clustering but but is not divisive, it is agglomerative. That is, every observation is placed into its own cluster and at each iteration or level or the hierarchy, observations are merged into fewer and fewer clusters until convergence. Similar to hierarchical clustering, the constructed hierarchy contains all possible numbers of clusters and it is up to the analyst to pick the number by reviewing statistics or the dendogram.

  • DBSCAN is based on point density rather than distance. It groups together points with many nearby neighbors. DBSCAN is one of the most cited algorithms in the literature. It does not require knowing the number of clusters a priori, but does require specifying the neighborhood size.

Clustering Algorithms in Scikit-learn

</colgroup>

Method name

Parameters

Scalability

Use Case

Geometry (metric used) </tr> </thead>

K-Means

number of clusters

Very largen_samples, medium n_clusters with MiniBatch code

General-purpose, even cluster size, flat geometry, not too many clusters

Distances between points </tr>

Affinity propagation

damping, sample preference

Not scalable with n_samples

Many clusters, uneven cluster size, non-flat geometry

Graph distance (e.g. nearest-neighbor graph) </tr>

Mean-shift

bandwidth

Not scalable with n_samples

Many clusters, uneven cluster size, non-flat geometry

Distances between points </tr>

Spectral clustering

number of clusters

Medium n_samples, small n_clusters

Few clusters, even cluster size, non-flat geometry

Graph distance (e.g. nearest-neighbor graph) </tr>

Ward hierarchical clustering

number of clusters

Large n_samples and n_clusters

Many clusters, possibly connectivity constraints

Distances between points </tr>

Agglomerative clustering

number of clusters, linkage type, distance

Large n_samples and n_clusters

Many clusters, possibly connectivity constraints, non Euclidean distances

Any pairwise distance </tr>

DBSCAN

neighborhood size

Very large n_samples, medium n_clusters

Non-flat geometry, uneven cluster sizes

Distances between nearest points </tr>

Gaussian mixtures

many

Not scalable

Flat geometry, good for density estimation

Mahalanobis distances to centers </tr>

Birch

branching factor, threshold, optional global clusterer.

Large n_clusters and n_samples

Large dataset, outlier removal, data reduction.

Euclidean distance between points </tr> </tbody> </table> Source: http://scikit-learn.org/stable/modules/clustering.html

Exercise Set VI

Exercise: Try clustering using the following algorithms.

  1. Affinity propagation
  2. Spectral clustering
  3. Agglomerative clustering
  4. DBSCAN

How do their results compare? Which performs the best? Tell a story why you think it performs the best.


In [19]:
# Your turn

Ultimately, comparing the results of different clustering methods can be quite subjective since we do not have correct labels such as in supervised learning. We can still attempt to compare the results by using the Silhouette score.


In [20]:
from sklearn.cluster import AffinityPropagation

best_score = 0
print('Affinity Propagation')
for d in np.arange(0.5, 1, 0.05):
    model = AffinityPropagation(damping=d)
    clusters = model.fit_predict(x_cols)

    score = silhouette_score(x_cols, clusters)
    
    if score > best_score:
        best_score = score
        
    print('damping =', d,': Silhouette score =', score)
    
print('\nBest Score =', best_score)


Affinity Propagation
damping = 0.5 : Silhouette score = 0.123465236045
damping = 0.55 : Silhouette score = 0.123465236045
damping = 0.6 : Silhouette score = 0.123465236045
damping = 0.65 : Silhouette score = 0.123465236045
damping = 0.7 : Silhouette score = 0.123465236045
damping = 0.75 : Silhouette score = 0.123465236045
damping = 0.8 : Silhouette score = 0.0895455499989
damping = 0.85 : Silhouette score = 0.0895455499989
damping = 0.9 : Silhouette score = 0.0895455499989
damping = 0.95 : Silhouette score = 0.0895455499989

Best Score = 0.123465236045

In [21]:
from sklearn.cluster import SpectralClustering

best_score = 0
print('Spectral Clustering')
for k in ks:
    model = SpectralClustering(n_clusters=k)
    clusters = model.fit_predict(x_cols)

    score = silhouette_score(x_cols, clusters)
    
    if score > best_score:
        best_score = score
        
    print('n_clusters =', k,': Silhouette score =', score)
    
print('\nBest Score =', best_score)


Spectral Clustering
n_clusters = 2 : Silhouette score = 0.0591405970172
n_clusters = 3 : Silhouette score = 0.0999059123102
n_clusters = 4 : Silhouette score = 0.0290656989328
n_clusters = 5 : Silhouette score = 0.0158567687769
n_clusters = 6 : Silhouette score = 0.0175964169953
n_clusters = 7 : Silhouette score = 0.0545158190649
n_clusters = 8 : Silhouette score = 0.0572454782216
n_clusters = 9 : Silhouette score = 0.0440764882939
n_clusters = 10 : Silhouette score = 0.0386665025459

Best Score = 0.0999059123102

In [22]:
from sklearn.cluster import AgglomerativeClustering

best_score = 0
print('Agglomerative Clustering')
for k in ks:
    model = AgglomerativeClustering(n_clusters=k)
    clusters = model.fit_predict(x_cols)

    score = silhouette_score(x_cols, clusters)
    
    if score > best_score:
        best_score = score
    
    print('n_clusters =', k,': Silhouette score =', score)
    
print('\nBest Score =', best_score)


Agglomerative Clustering
n_clusters = 2 : Silhouette score = 0.0825801782318
n_clusters = 3 : Silhouette score = 0.116258788636
n_clusters = 4 : Silhouette score = 0.128937578159
n_clusters = 5 : Silhouette score = 0.140897399708
n_clusters = 6 : Silhouette score = 0.147152172046
n_clusters = 7 : Silhouette score = 0.152751527511
n_clusters = 8 : Silhouette score = 0.155780537186
n_clusters = 9 : Silhouette score = 0.12075124132
n_clusters = 10 : Silhouette score = 0.0975475074715

Best Score = 0.155780537186

In [23]:
from sklearn.cluster import DBSCAN

best_score = 0
for e in np.arange(1, 2.6, 0.1):
    for sample in range(1,11):
        model = DBSCAN(eps= e, min_samples=sample)
        clusters = model.fit_predict(x_cols)

        if len(collections.Counter(clusters).keys()) > 1:
            score = silhouette_score(x_cols, clusters)
            
            if score > best_score:
                best_score = score
            
            print('eps=', e, ', min_samples=', sample, ': Silhouette score =', score)
            
print('\nBest Score =', best_score)


eps= 1.0 , min_samples= 1 : Silhouette score = 0.0487022222466
eps= 1.0 , min_samples= 2 : Silhouette score = 0.0254522297235
eps= 1.0 , min_samples= 3 : Silhouette score = 0.0383161614218
eps= 1.0 , min_samples= 4 : Silhouette score = 0.0417711827655
eps= 1.0 , min_samples= 5 : Silhouette score = 0.0127152032749
eps= 1.0 , min_samples= 6 : Silhouette score = 0.0205286831903
eps= 1.0 , min_samples= 7 : Silhouette score = -0.0128450751348
eps= 1.0 , min_samples= 8 : Silhouette score = 0.00882539854122
eps= 1.0 , min_samples= 9 : Silhouette score = 0.057334735345
eps= 1.0 , min_samples= 10 : Silhouette score = 0.057334735345
eps= 1.1 , min_samples= 1 : Silhouette score = 0.0487022222466
eps= 1.1 , min_samples= 2 : Silhouette score = 0.0254522297235
eps= 1.1 , min_samples= 3 : Silhouette score = 0.0383161614218
eps= 1.1 , min_samples= 4 : Silhouette score = 0.0417711827655
eps= 1.1 , min_samples= 5 : Silhouette score = 0.0127152032749
eps= 1.1 , min_samples= 6 : Silhouette score = 0.0205286831903
eps= 1.1 , min_samples= 7 : Silhouette score = -0.0128450751348
eps= 1.1 , min_samples= 8 : Silhouette score = 0.00882539854122
eps= 1.1 , min_samples= 9 : Silhouette score = 0.057334735345
eps= 1.1 , min_samples= 10 : Silhouette score = 0.057334735345
eps= 1.2 , min_samples= 1 : Silhouette score = 0.0487022222466
eps= 1.2 , min_samples= 2 : Silhouette score = 0.0254522297235
eps= 1.2 , min_samples= 3 : Silhouette score = 0.0383161614218
eps= 1.2 , min_samples= 4 : Silhouette score = 0.0417711827655
eps= 1.2 , min_samples= 5 : Silhouette score = 0.0127152032749
eps= 1.2 , min_samples= 6 : Silhouette score = 0.0205286831903
eps= 1.2 , min_samples= 7 : Silhouette score = -0.0128450751348
eps= 1.2 , min_samples= 8 : Silhouette score = 0.00882539854122
eps= 1.2 , min_samples= 9 : Silhouette score = 0.057334735345
eps= 1.2 , min_samples= 10 : Silhouette score = 0.057334735345
eps= 1.3 , min_samples= 1 : Silhouette score = 0.0487022222466
eps= 1.3 , min_samples= 2 : Silhouette score = 0.0254522297235
eps= 1.3 , min_samples= 3 : Silhouette score = 0.0383161614218
eps= 1.3 , min_samples= 4 : Silhouette score = 0.0417711827655
eps= 1.3 , min_samples= 5 : Silhouette score = 0.0127152032749
eps= 1.3 , min_samples= 6 : Silhouette score = 0.0205286831903
eps= 1.3 , min_samples= 7 : Silhouette score = -0.0128450751348
eps= 1.3 , min_samples= 8 : Silhouette score = 0.00882539854122
eps= 1.3 , min_samples= 9 : Silhouette score = 0.057334735345
eps= 1.3 , min_samples= 10 : Silhouette score = 0.057334735345
eps= 1.4 , min_samples= 1 : Silhouette score = 0.0487022222466
eps= 1.4 , min_samples= 2 : Silhouette score = 0.0254522297235
eps= 1.4 , min_samples= 3 : Silhouette score = 0.0383161614218
eps= 1.4 , min_samples= 4 : Silhouette score = 0.0417711827655
eps= 1.4 , min_samples= 5 : Silhouette score = 0.0127152032749
eps= 1.4 , min_samples= 6 : Silhouette score = 0.0205286831903
eps= 1.4 , min_samples= 7 : Silhouette score = -0.0128450751348
eps= 1.4 , min_samples= 8 : Silhouette score = 0.00882539854122
eps= 1.4 , min_samples= 9 : Silhouette score = 0.057334735345
eps= 1.4 , min_samples= 10 : Silhouette score = 0.057334735345
eps= 1.5 , min_samples= 1 : Silhouette score = 0.0165341682405
eps= 1.5 , min_samples= 2 : Silhouette score = 0.0950039187709
eps= 1.5 , min_samples= 3 : Silhouette score = 0.132124010985
eps= 1.5 , min_samples= 4 : Silhouette score = 0.121754985088
eps= 1.5 , min_samples= 5 : Silhouette score = 0.119898866858
eps= 1.5 , min_samples= 6 : Silhouette score = 0.118135287204
eps= 1.5 , min_samples= 7 : Silhouette score = 0.118135287204
eps= 1.5 , min_samples= 8 : Silhouette score = 0.117007663757
eps= 1.5 , min_samples= 9 : Silhouette score = 0.117007663757
eps= 1.5 , min_samples= 10 : Silhouette score = 0.115395040347
eps= 1.6 , min_samples= 1 : Silhouette score = 0.0165341682405
eps= 1.6 , min_samples= 2 : Silhouette score = 0.0950039187709
eps= 1.6 , min_samples= 3 : Silhouette score = 0.132124010985
eps= 1.6 , min_samples= 4 : Silhouette score = 0.121754985088
eps= 1.6 , min_samples= 5 : Silhouette score = 0.119898866858
eps= 1.6 , min_samples= 6 : Silhouette score = 0.118135287204
eps= 1.6 , min_samples= 7 : Silhouette score = 0.118135287204
eps= 1.6 , min_samples= 8 : Silhouette score = 0.117007663757
eps= 1.6 , min_samples= 9 : Silhouette score = 0.117007663757
eps= 1.6 , min_samples= 10 : Silhouette score = 0.115395040347
eps= 1.7 , min_samples= 1 : Silhouette score = 0.0165341682405
eps= 1.7 , min_samples= 2 : Silhouette score = 0.0950039187709
eps= 1.7 , min_samples= 3 : Silhouette score = 0.132124010985
eps= 1.7 , min_samples= 4 : Silhouette score = 0.121754985088
eps= 1.7 , min_samples= 5 : Silhouette score = 0.119898866858
eps= 1.7 , min_samples= 6 : Silhouette score = 0.118135287204
eps= 1.7 , min_samples= 7 : Silhouette score = 0.118135287204
eps= 1.7 , min_samples= 8 : Silhouette score = 0.117007663757
eps= 1.7 , min_samples= 9 : Silhouette score = 0.117007663757
eps= 1.7 , min_samples= 10 : Silhouette score = 0.115395040347
eps= 1.8 , min_samples= 1 : Silhouette score = 0.048929146662
eps= 1.8 , min_samples= 2 : Silhouette score = 0.18020378233
eps= 1.8 , min_samples= 3 : Silhouette score = 0.18020378233
eps= 1.8 , min_samples= 4 : Silhouette score = 0.18020378233
eps= 1.8 , min_samples= 5 : Silhouette score = 0.18020378233
eps= 1.8 , min_samples= 6 : Silhouette score = 0.175921148074
eps= 1.8 , min_samples= 7 : Silhouette score = 0.175978738794
eps= 1.8 , min_samples= 8 : Silhouette score = 0.16845753602
eps= 1.8 , min_samples= 9 : Silhouette score = 0.16845753602
eps= 1.8 , min_samples= 10 : Silhouette score = 0.16845753602
eps= 1.9 , min_samples= 1 : Silhouette score = 0.048929146662
eps= 1.9 , min_samples= 2 : Silhouette score = 0.18020378233
eps= 1.9 , min_samples= 3 : Silhouette score = 0.18020378233
eps= 1.9 , min_samples= 4 : Silhouette score = 0.18020378233
eps= 1.9 , min_samples= 5 : Silhouette score = 0.18020378233
eps= 1.9 , min_samples= 6 : Silhouette score = 0.175921148074
eps= 1.9 , min_samples= 7 : Silhouette score = 0.175978738794
eps= 1.9 , min_samples= 8 : Silhouette score = 0.16845753602
eps= 1.9 , min_samples= 9 : Silhouette score = 0.16845753602
eps= 1.9 , min_samples= 10 : Silhouette score = 0.16845753602
eps= 2.0 , min_samples= 1 : Silhouette score = 0.164688372397
eps= 2.0 , min_samples= 2 : Silhouette score = 0.234889659521
eps= 2.0 , min_samples= 3 : Silhouette score = 0.234889659521
eps= 2.0 , min_samples= 4 : Silhouette score = 0.234889659521
eps= 2.0 , min_samples= 5 : Silhouette score = 0.234889659521
eps= 2.0 , min_samples= 6 : Silhouette score = 0.234889659521
eps= 2.0 , min_samples= 7 : Silhouette score = 0.234889659521
eps= 2.0 , min_samples= 8 : Silhouette score = 0.234889659521
eps= 2.0 , min_samples= 9 : Silhouette score = 0.234889659521
eps= 2.0 , min_samples= 10 : Silhouette score = 0.234889659521
eps= 2.1 , min_samples= 1 : Silhouette score = 0.164688372397
eps= 2.1 , min_samples= 2 : Silhouette score = 0.234889659521
eps= 2.1 , min_samples= 3 : Silhouette score = 0.234889659521
eps= 2.1 , min_samples= 4 : Silhouette score = 0.234889659521
eps= 2.1 , min_samples= 5 : Silhouette score = 0.234889659521
eps= 2.1 , min_samples= 6 : Silhouette score = 0.234889659521
eps= 2.1 , min_samples= 7 : Silhouette score = 0.234889659521
eps= 2.1 , min_samples= 8 : Silhouette score = 0.234889659521
eps= 2.1 , min_samples= 9 : Silhouette score = 0.234889659521
eps= 2.1 , min_samples= 10 : Silhouette score = 0.234889659521
eps= 2.2 , min_samples= 1 : Silhouette score = 0.164688372397
eps= 2.2 , min_samples= 2 : Silhouette score = 0.234889659521
eps= 2.2 , min_samples= 3 : Silhouette score = 0.234889659521
eps= 2.2 , min_samples= 4 : Silhouette score = 0.234889659521
eps= 2.2 , min_samples= 5 : Silhouette score = 0.234889659521
eps= 2.2 , min_samples= 6 : Silhouette score = 0.234889659521
eps= 2.2 , min_samples= 7 : Silhouette score = 0.234889659521
eps= 2.2 , min_samples= 8 : Silhouette score = 0.234889659521
eps= 2.2 , min_samples= 9 : Silhouette score = 0.234889659521
eps= 2.2 , min_samples= 10 : Silhouette score = 0.234889659521
eps= 2.3 , min_samples= 7 : Silhouette score = 0.279222772917
eps= 2.3 , min_samples= 8 : Silhouette score = 0.279222772917
eps= 2.3 , min_samples= 9 : Silhouette score = 0.279222772917
eps= 2.3 , min_samples= 10 : Silhouette score = 0.279222772917
eps= 2.4 , min_samples= 7 : Silhouette score = 0.279222772917
eps= 2.4 , min_samples= 8 : Silhouette score = 0.279222772917
eps= 2.4 , min_samples= 9 : Silhouette score = 0.279222772917
eps= 2.4 , min_samples= 10 : Silhouette score = 0.279222772917

Best Score = 0.279222772917

After optimizing each clustering model, the best model found used the DBSCAN algorithm with a silhouette score of 0.279.